synthesizing audio
Synthesizing Audio from Silent Video using Sequence to Sequence Modeling
Belinchon, Hugo Garrido-Lestache, Mulugeta, Helina, Haile, Adam
Generating audio from a video's visual context has multiple practical applications in improving how we interact with audio-visual media - for example, enhancing CCTV footage analysis, restoring historical videos (e.g., silent movies), and improving video generation models. We propose a novel method to generate audio from video using a sequence-to-sequence model, improving on prior work that used CNNs and WaveNet and faced sound diversity and generalization challenges. Our approach employs a 3D Vector Quantized Variational Autoencoder (VQ-VAE) to capture the video's spatial and temporal structures, decoding with a custom audio decoder for a broader range of sounds. Trained on the Youtube8M dataset segment, focusing on specific domains, our model aims to enhance applications like CCTV footage analysis, silent movie restoration, and video generation models.
[R] WaveGAN: Synthesizing Audio with Generative Adversarial Networks • r/MachineLearning
I don't see why you're so eager to bash this that hard. Most GAN papers work on images 128x128 which is about the sample size in 1s audio, and even with the most clever tricks so far like LAPGAN or PGGAN the best is about 1024x1024 images. This is the very first published GAN model that is successfully trained with 1-D convolutions without skip connections - which means that it can generate audio samples with completely unsupervised fashion directly from latent samples. Can you imagine the new possibilities on generative audio modeling stemming from this, like people did on images during last couple years? Also, people created videos from frames obtained from CycleGAN and they didn't linearly scale everything like you like to do so much.